Prediction of arabica coffee production using artificial neural network and multiple linear regression techniques

Crop yield and its prediction are crucial in agricultural production planning. This study investigates and predicts arabica coffee yield in order to match the market demand, using artificial neural networks (ANN) and multiple linear regression (MLR). Data of six variables, including areas, productivity zones, rainfalls, relative humidity, and minimum and maximum temperature, were collected for the recent 180 months between 2004 and 2018. The predicted yield of the cherry coffee crop continuously increases each year. From the dataset, it was found that the prediction accuracy of the R2 and RMSE from ANN was 0.9524 and 0.0784 tons, respectively. The ANN model showed potential in determining the cherry coffee yields.


Results
Descriptive statistics. In this study, we provided the predicted yield of arabica coffee for 15 years from 2004 to 2018. Coffee is seasonally produced and available 6 months a year. Off-season, there is no yield. The dataset used in the study consists of factors related to crop yield, including the area, productivity zone, monthly rainfall, monthly RH, and monthly temperature. The data source is from the northern part of Thailand, which has been collected for 180 months. A summary of the statistical characteristics of the dataset is presented in Table 1. It was shown that the productivity zone had a positive skewness distribution of 0.413, and the temperature had a negative skewness distribution of − 0.467. The other variables were slightly skewed. The following subsection shows the results of the modeling. Productivity, also known as crop yield, is the variable to be predicted using the MLR. To apply MLR modeling, the correlation between pairs of input variables was examined to decide which variables to include in the model. The results are shown in Table 2.
In the MLR model, it is common to include the input variables with a high linear correlation with the response variable but to exclude ones that are highly correlated among themselves because of the multicollinearity problem. Table 2 shows that the area and productivity zone variables are highly positively correlated, while the variables Tmax and area and Tmax and RH are moderately positively and negatively correlated, respectively. These results might lead to a multicollinearity problem; however, in this study, we included all variables in the MLR, as the dataset is one hundred eighty observations and six variables. The results are shown in "Prediction results".
Prediction results. The MLR, in which the equation was derived from the full model of six variables, is used to predict crop yield. The MLR equation, R 2 , and the Durbin-Watson test are shown in Table 3. From Table 3, the model that includes six input datasets yields a high value of R 2 , which indicates that the coffee features substantially explain the change in the crop yield (92.35%). Meanwhile, the Durbin-Watson test is 1.56, which reveals a suitable autocorrelation for the predictive errors. After data analysis, it was found that the R 2 is 0.9235, RMSE is 0.0784, and MSE is 0.0061 ton, which is suitable for the model (see Fig. 3). From Fig. 3, when the MLR model tested the input data and crop yield, it can be seen that the linear relationship was preserved between the variable input datasets and the crop yield. Thus, farmers and entrepreneurs could use this data set to predict cherry coffee productivity in the next period. www.nature.com/scientificreports/ The ANN analysis was performed based on the previous 180 months' input dataset. In all ANN analyses, the input and output had eight neurons and one layer. The number of neurons in the hidden layer was defined using trial and error to find the best ANN configuration for predicting the crop yield of cherry coffee. The ANN configuration determined to be optimal should minimize the MSE and RMSE and optimize R 2 . The ANN configuration from the network topology started from one and then consecutively increased.        Table 4, the network properties will set the treatments of the (i) type of network, (ii) input and target data, (iii) training and adaptation learning function, (iv) performance function, (v) layer number, (vi) number of neurons, and (vii) transfer function. Feed-forward backpropagation was used for setting the network type. The number of hidden layers is fixed as two, numbered H1 and H2, and the number of neurons will set the number of processing elements. Then, these data will be trained in order to obtain the best ANN performance. After training the ANN model, the best of the various ANN configurations is with the second hidden layer and the eighth processing element. Meanwhile, the MSE and R 2 are 1027.99 and 0.9524 tons, respectively, as shown in Figs. 4 and 5. Figure 4 shows the MSE from the random data generated, which included the training, validation, and testing data. The learning rate and momentum were constant at 1000, 0.5, and 0.1, respectively. The results show that the learning cycles were 49 epochs, and the best validation performance was found at 43 epochs with the MSE at 1027.99. Table 5 shows the outcome of the coefficient (R 2 ) of the training and testing data phases. The neural network confirms the precise tools with perfect prediction accuracy in the responses of genotypes for coffee crop yield based on the measured indices. Thus, the ANN model was demonstrated to provide a perfect agreement between the exact and approximate coffee productivity. However, the prediction potential of productivity by ANN (0.9524) appeared to show higher accuracy than MLR (0.9235). MLR and ANN were compared through machine learning techniques using historical areas, productivity zones, rainfalls, relative humidity, and minimum and maximum temperature for the recent 180 months.
According to the study, MLR and ANN have focused on practical interpolation methods. RMSE and MSE cross-validation were accomplished to assess the efficiency of each month (168 months). It was also estimated through the average of RMSE and MSE. However, the cross-validation of MLR showed that the average values of RMSE and MSE are 0.0618 and 0.0038, respectively. Nonetheless, in comparison to the full model of coffee prediction for 180 months, RMSE was 0.0784, and MSE was 0.0061 tons, respectively. The results showed that both of them were not significantly different.

Discussion and conclusion
The crop yield of cherry coffee varies because of several factors such as the cultivated area, amount of rainfall, temperature, and RH. These conditions affect the cherry coffee yield from each month's crop. Prediction is necessary to forecast the productivity accurately and to satisfy customer demand. Previously, farmers and entrepreneurs lack knowledge regarding the future cherry coffee yield, as well as which factors can change the cherry coffee crop yield each month. For this reason, production planning to match the supply to the demand is rather difficult. However, if farmers and entrepreneurs have sufficient data and can utilize it using machine learning models, they may realize more accurate estimate of the crop yield.
In this work, MLR and ANN were used to analyze and predict the coffee crop yield using data from 2004 to 2018, in which input data for 132 months was used for training, 24 months for validation, and 24 months for testing. Calculation of the cherry coffee crop yield required MATLAB programming. The MLR equation was derived from the full model consisting of six variables. The data shows that the area (X 1 ) ranged between 42,216 and 20,158 acres with an average of 34,003 acres, playing an essential role in cherry coffee production. The available area is a crucial component for coffee tree planting because it determines the farmer's cherry coffee productivity. The productivity zone (X 2 ) shows data ranging between 37,710 and 14,513 acres, with an average of 23,921 acres. This is also essential because it indicates how much the farmer can earn from the cherry coffee yield 73 . The rainfall data (X 3 ) is from 0 to 509.8, but the average rainfall is 1830 mm per month. The amount www.nature.com/scientificreports/ of RH (X 4 ) is between 55.0 and 82.0%RH, and the mean is 76.0%RH which is suitable for coffee cultivation 74 . Moreover, the temperature (X 5 and X 6 ) is a factor that could affect coffee production and is shown in the data to be between 5.10 and 40.0 °C with an average of 25.5 °C 75 . The correlation was checked between independent variables. As in Table 2, there are some high correlations, such as those between the minimum temperature and productivity, the area and productivity zone, and the productivity zone and productivity, with values of 0.9175, 0.8119, and 0.7775, respectively. For this reason, it was necessary to put all the variables into the model or may represent them with VIF < 10 values (variance inflation factor). The assumption of MLR was adjusted and estimated from Y' = 0.2850 + 0.0402X 1 + 0.3128X 2 − 0.0486X 3 − 0.4040X 4 − 0.6144X 5 + 1.1065X 6 with 0.9235 for the standard error of estimation. The R 2 and Durbin-Watson values confirmed the appropriateness of the MLR equation, as shown in Table 3. R 2 and Durbin-Watson values were 0.9235 and 1.56, respectively which are suitable for this MLR model. Meanwhile, R 2 in Table 3 also indicates that the coffee variable substantially explains the changes in the crop yield, while the Durbin-Watson test reveals a low autocorrelation for the predictive errors.
The importance of each input variable on the yield was also analyzed based on the MLR mathematical expression. The equation coefficients showed that Tmin monthly variable (X 6 ) was the most important (1.1065). The  www.nature.com/scientificreports/ second-most important factor was Tmax monthly (X 5 ) with a coefficient of − 0.6144. The relative humidity (X 4 ), the productivity zone (X 2 ), and monthly total rainfall (X 3 ) were inferior, with values of − 0.4040, 0.3128, and − 0.0486, respectively. Lastly, the area (X 1 ) was shown to be the least essential variable (0.0402).
Similarly, for the ANN model, the relative importance of each input variable was evaluated based on oneway partial dependence plots (PDPs), as shown in Fig. 6. It can be seen that the monthly Tmax variable (X 5 ) was the most important, in which the PDP value varied from 0.45 to 0.05. Similarly, the second most influential variable was monthly total rainfall (X 3 ), with PDP values of 0.25 to 0.05. The third variable was relative humidity (X 4 ) of 0.138 to 0.103. Additionally, the last three variable groups were productivity zone (X 2 ), area (X 1 ), and Tmin monthly (X 6 ), showing the PDP values varied from 0.240 to 0.07, 0.158 to 0.118, and 0.160 to 0.110, respectively 76,77 . In Fig. 6e, when the Tmax monthly variable increased, cheery coffee's productivity was lowered, especially when it was higher than 29 °C. In other words, if the temperature was less than 29 °C, the productivity of cheery coffee was improved. From Fig. 6b, if we have more productivity zones, productivity will increase accordingly. From Fig. 6d, the productivity would be high if the cherry coffee experienced high relative humidity. On the contrary, if the amount of rain were too large, it would negatively affect productivity. Based on the PDPs in Fig. 6c, the cherry coffee should rain at less than 100 mm per month, leading to a high crop yield.
Likewise, the ANN method can obtain the input dataset from the available data. Then, the data was imported from the MATLAB workspace, and a variable was selected. A network or data was created to random the hidden layer until the performances of the various ANN configurations were known. The best performance is shown in Table 5, with values of 0.9524 and 0.0784 ton for R 2 and RMSE, respectively. The results of both models indicate that the cherry coffee crop yield can be accurately predicted if the farmers or entrepreneurs are able to utilize these data.
In comparing MLR and ANN performances with other agricultural-related predictions, the R 2 of the MLR was better than the ANN in some cases due to its limited number of input datasets. This was so for Bayat et al. work 78 on predicting tree survival and mortality in the Hyrcanian forest. The six input variables were used for 9 years of prediction, including the wind velocity and direction, diameter at breast height, basal area of larger trees, diameter, topographic wetness index, and temperature. The R 2 of ANN and MLR were 0.6790 and 0.9200, respectively. Ustaoglu et al. 79 used MLR and ANN for temperature forecasts through daily mean, maximum, and minimum temperature in time series. The R 2 of ANN was 0.8590, lower than that of MLR (0.9025).
The R 2 of ANN was generally found to be better than that of MLR when many datasets were used, for example, in predicting rainfall 75,80 , and maize yield through the temperature and precipitation data 81 , monthly pan evaporation and soil temperatures 82,83 , the relative humidity of climate in Turkey 84 , and crop prediction 85,86 , as shown in Table 6. For the cherry coffee crop yield by MLR and ANN in this work, the R 2 were 0.9235 and 0.9524, respectively, while RMSE was 0.0784 ton, showing the suitability for crop yield prediction.
MLR and ANN were effective in forecasting the crop yield of cherry coffee. These methods can be utilized from upstream to downstream in coffee planning to respond to customer demand. Moreover, they can reduce the risk of insufficient production in the high season and improve future efficiency and effectiveness for industrial and agricultural products.
Concerning limitations, this research considered only six input datasets, which were the area, amount of rainfall, temperature, and RH that affected the efficiency of increased crop yield. More related data input such www.nature.com/scientificreports/ as farm location, irrigation water depth applied, precipitation, solar radiation, potential evapotranspiration, soil moisture, land cultivated, pH, nitrogen, phosphorus, potassium, air pressure, wind speed, and climate variables should also be considered. The crop yield prediction could be improved further [87][88][89] .
Once the crop yield and amount of cherry coffee are accurately predicted for future works, farmers or entrepreneurs can prepare their manufacturing resources for later stages of production planning accordingly. These may include (i) color sorters to classify the color shade of cherry coffee for higher added value and (ii) dryer machines to reduce the moisture content in the cherry coffee.
The ANN algorithm has been demonstrated to be effective in making predictions of production rates/yields from industries of coffee plantings. However, the dataset used may be imbalanced. It is essential to note that, in identifying the optimal granularity and refining the imbalanced dataset, combining granular computing and ML techniques is of great interest 90,91 . In general, the predictions of developed ML models are based on knowledge learning from the complex multi-dimensional relationships between features and target(s) 92 . When an imbalanced, small dataset is employed to train and develop an ML model, the developed model would have a poor ability to make predictions. An innovative approach of combining granular computing with ML or deep learning may be considered further, in which the dataset's quality and efficiency could be improved via clustering examples in the best granularity, sampling, and refining into a more balanced example set, such as those deployed in services planning under social manufacturing context and bi-level model for risk classification of respiratory diseases [90][91][92][93][94] . Additionally, reinforcement learning techniques may be considered to predict agricultural production. The learning process happens along the way with six main components of (i) agent, (ii) action, (iii) environment, (iv) state, (v) policy, and (vi) reward. It was successful for order acceptance decision of mass-individualized printed circuit board manufacturing 95 . In this research, only six main input features were considered. Therefore, reinforcement learning with more complex datasets may be considered for cherry coffee and other agricultural industries in the near future.

Materials and methods
Research framework. This study focuses on predicting cherry coffee. The scope of this study covers four steps from plantation until harvesting (see Fig. 7).
Data collection and treatment. The data from the Climate Department included rainfall, RH, and minimum and maximum temperature. The Agricultural Economics Office and Meteorological Department provided the area and productivity zone data. The input dataset consisted of the area, rainfall, temperature, and (RH), while the output consisted of the crop yield of cherry coffee. The input and output data were normalized within the range 0-1 using where N is the normalized data; X is the measured value: Xmin and Xmax are the minimum and maximum values. The ANN performance must be estimated by means of a validation dataset. The selected ANN was tested, and its performance was compared using the coefficient of determination (R 2 ), the mean squared error (MSE), and the percentage error. where Y i is the dependent variable; b 0 is a constant (intercept); X i,k is an independent variable; b k is the vector of regression coefficients (slope); and e i is random measured errors.
Data set, training and selection of optimal on figuration. The variables data were randomly split into three groups to prevent overfitting in the training step because a high number of neurons existed in the hidden layer. Training, validation, and testing of models were based on 70%, 15%, and 15% of the data, respectively 68 . In the meantime, during neural network training, the algorithm provided the minimum or maximum performance via the shortest path to yield the network's size. At the same time, backpropagation was performed through the training set in order to update the rule to seek and find the minimum mean square error over the training set. The best point of generalization form validation was obtained from a training stop when the error started to increase until the final increase point. Next, the networks were trained with the testing set, which was also compared with the output dataset with the desired output, and the model weight was fixed 96 . After the training epoch, the mean square error of the network running of the validation set was calculated. Then, the neurons will apply a specific linear function in the hidden layer and a specific linear function in a hidden layer to collect the linear combination and bias. Finally, the output data will give the predicted model.
The validated model was evaluated based on the observed and predicted coffee crop yield difference. The 180 months of data on coffee production were randomly evaluated through a k-fold cross-validation technique but evenly partitioned into 168 months of the dataset. The dataset will be further randomly evaluated through three rounds of MLR and ANN (ten processing elements and two hidden layers) training. Each dataset was individually used as validation, while the enduring ones were treated as training datasets. From the k-fold crossvalidation, RMSE and MSE were evaluated to determine the models' performance: (2) y i = b 0 + b 1 X i,1 + b 2 X i,2 + · · · + b k X i,k + e i ,  Prediction of crop yield. In this model, the crop yield was considered to assess indices through the input and output data. A partial dependence plot (PDP) was performed to evaluate each variable as an independent variable during the neural network process 76,77 . The relative importance of the parameters for each index was computed by dividing the value of its importance by the highest importance value. Consequently, the model with the largest R 2 and the smallest RMSE was considered the best. The ratios of the correlation between standard errors and mean values were calculated by evaluating the accuracy in terms of repeatability and similarity for yield-based with the maximal indices up to 1000 bootstrap samples similar to the coefficient of variation (CV). A low CV indicates a selection index that is more accurate 65 . All implementations were accomplished using the neural network toolbox in the MATLAB package. Therefore, in the present study, MATLAB version R2019b, with Windows 10 and a 64-bit machine, was used to run the MLR and ANN models.

Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request. www.nature.com/scientificreports/